seo

Lessons Learned Building an Index of the WWW

Last week I gave the keynote presentation at SMX Munich, Lessons Learned Building an Index of the WWW. In that presentation, I shared a great deal of data from our web index as well as some SEO tips based on our experience replicating many search engine activities (crawling, indexing, building a link graph, de-duplication, canonicalization, etc.). In this blog post, I’d like to first announce that Linkscape’s new index, with crawl data from late March to early April (& upon which these data points are calculated), is now live – check it out here – and second, to share the charts, graphs and tips from my presentation.

The Linkscape Index

First off, some basic points about Linkscape’s index:

  • The crawl is intended to imitate what major search engines crawl and keep in their index. Talking to lots of folks from the engines who do this work, we’ve heard that while tens or hundreds of billions of pages are crawled, there are only “~5-10 billion pages worth keeping in a main index.”
  • Linkscape is a crawler-built index, meaning it uses a seed set and crawls outward via links to discover new URLs.
  • The index currently biases towards pages with external links, meaning we don’t crawl as deeply as the major engines do, but we try to crawl. very broadly (to reach as many well-connected pages and unique domains as possible).
  • The crawlers and data sources we currently employ all respect robots.txt.

The Web’s Structure

As we crawl, we see some well-known structural pieces making up the web:

Web's Data Structure

Linkscape, as well as numerous academic sources (and, almost certainly, the major search engines), collect and store data about three types of structural components – pages, subdomains and root domains. Link & content metrics, along with crawl parameters and query-independent ranking factors, are stored about each of these.

Linkscape also sees a view of the web that most IR students will be familiar with:

Bowtie Visualization of the Web's Linking Structure

As others have noted in the past, the web’s link structure tends to look a bit like a bowtie, with a large number of tightly linked, well connected pages in the center and outliers on the borders with few incoming/outbound links. Linkscape does a relatively good job with the center and the linked-to edge (with few/no outbounds), but struggles more on pages with no incoming links (as these are difficult to discover and often not worthwhile keeping in an index).

Index Statistics

We’ve found these data points fascinating and I’m excited to be able to share many of them for the first time. While Linkscape is not as comprehensive as Yahoo!/Google, it’s far closer to a representation than a sample size. Our latest index update currently contains:

  • 44,410,893,857 (44 Billion) pages
  • 230,211,915 (230 Million) subdomains
  • 54,712,427 (54 Million) root domains
  • 474,779,069,489 (474 Billion) links

For this index, the following data pieces apply:

Page Response Codes

Distribution of Subdomains

Distribution of Pages

Distribution of Links

* Note that for the link distribution chart, this refers to “external, juice-passing links” which excludes links from the same subdomain to itself as well as links on pages with the meta nofollow or those that employ rel=nofollow.

Distribution of Linking Root Domains

* Note that for the root domains linking chart, this refers only to pages/sites receiving links from unique root domains. For example, with www.seomoz.org, we’d only receive one “linking root domain” from searchengineland.com, even though that site links to ours on many unique pages. Likewise, with links we receive from About.com and their numerous subdomains – in total, it’s only one counted “unique root domain.”

Common Link Attributes

* Not surprisingly, most links on the web are incestuous to some degree, and thus come from internal links (those on the same subdomain as the target), same IP address (where multiple sites from the same owner are hosted), same root domain and the same c-block of IP addresses. If we can see these relationships with Linkscape, it follows that the search engines have an easy time of it as well – and these links are almost certainly not passing the same kind of value that external links from unique root domains, IP-addresses and C-blocks would.

Uncommon Link Attributes

Some interesting data points on the above:

  • 2.7% of all links on the web are nofollowed
  • 73% of those are internal (so nofollow is actually far more popular as a link sculpting tool than a spam prevention device)
  • 3 billion out of our 475 billion links (~0.6%) were found in noscript tags – while the engines recommend against this and talk about it as a spam tactic, we suspect that many of these are, in fact, legitimate uses and probably do get counted (due to their value in content discovery).
  • 165,638,731 links (0.034%) aren’t visible on the page (they’re hidden off screen using CSS or other tactics). Again, given the numbers, we wonder whether all of these are spam and whether they’re all discounted by the engines.
  • This is our first index supporting the canonical URL tag, and so far we’ve seen just north of 16 million pages employing the parameter. While this is still a drop in the bucket on a global web scale, we’ll be watching closely for how much support it generates over the months to come.

Search Engine & Linkscape Metrics

Like the search engines, we calculate a number of metrics on the pages, subdomains and root domains in our index to help uncover spam and sort by popularity & trustworthiness. The following are distributions of the metrics we currently employ:

Distribution of mozRank

* mozRank is our calculation of raw link popularity. Like Google’s PageRank, Yahoo!’s WebRank and Live’s StaticRank, it’s a recursive algorithm that counts links as votes and treats links from more popular pages as more important. We’ve found that while it’s useful for discovering which pages to crawl and index, it’s a poor measure of true importance and has significant noise.

Distribution of Domain-Level mozRank

* Domain mozRank is calculated in the same fashion as page-level mozRank, but on the domain-level link graph. Thus, it only takes into account unique links that exist from one root domain to another and is agnostic as to whether a site has 1, 100 or 1,000 links to another. We’ve found this metric exceptionally valuable for identifying the popularity and importance of a root domain – on the subdomain link graph, it’s more susceptible to manipulation and spam.

Distribution of Domain-Level mozTrust

* mozTrust, which we also calculate on both the domain and page level link graphs, has proven highly effective as a spam identifier (particularly in combination with mozRank – the difference between the two is an excellent predictor of manipulative linking). mozTrust relies on the same intuition as Yahoo!’s TrustRank, running a recursive algorithm that passes juice down from trusted seed URLs/domains.

Measuring Correlation

Possibly the most interesting data I shared from an SEO application standpoint was around our research into the correlation of individual metrics to search engine rankings. Our own Ben Hendrickson has been doing significant data gathering and analysis, trying to answer the question,

How well does any single metric predict higher rankings?

His early results are enlightening:

Correlation of Metrics with Google Rankings n+1

In this chart, Ben’s showing that no metric is particularly good at predicting rankings by itself, but if you had to use something, the number of root domains linking to a URL and that URL’s mozRank are both just above the 95% confidence interval. Note that such classic SEO metrics as Yahoo! link counts and Alexa.com counts (which are included in many toolbars and appear in many SEO reports) are very nearly worthless.

Correlation of Metrics with Google Rankings n+10

The results are much better (though still not excellent) when we instead ask what metrics correlate with ranking 10 positions higher (essentially, what’s the difference between page 1 vs. 2, 2 vs. 3, etc). Here, Ben shows that while only a single metric is above the 95% confidence interval (domains linking to a URL), there are several that are 20%+ better than random guessing.

Perhaps the most surprising result of this (for me, at least) was the data showing that Google’s link counts actually do have a correlation with rankings, suggesting that they’re not completely random (even though they might feel that way given their small sample size).

Out of all the metrics, it’s little surprise that # of linking root domains is a favorite (we use it, for example, to sort our Top 500 list). It’s one of the most difficult metrics to manipulate effectively and has high correlation with trust, importance and search engine rankings.

Top Tips for SEOs

Based on the work we do crawling and building an index, and the struggles we’ve encountered (and seen the engines similarly encounter), we’ve crafted a few short tips. While some of these are obvious and well known, they still pay to keep in mind as high-level recommendations we feel confident the search engines would support:

  1. Don’t rely on the search engine to canonicalize anything for you.
  2. Focus on link acquisition from a diverse number of root domains, not necessarily high PageRank pages, or those with high link counts.
  3. Make smart, usable, short URLs. They’re far easier to process and have a much better correlation with useful, unique content an engine would want to keep in its index.
  4. If you want to earn lots of links, building a distributed content widget/badge/link that users embed in their sites/pages is an incredibly effective strategy. Just look at how many of the top pages on the web achieved that position employing this strategy.
  5. Don’t rely on PageRank or raw link counts as accurate assessments of ranking potential. According to our data, they’re not high signal or high rankings correlation metrics.
  6. The social web is rising, as are those employing it effectively (again, check out the top sites list for evidence).
  7. Don’t be afraid to use nofollow internally as it’s clearly not an outlier on the web. However, do be cautious with its use – you can seriously screw things up if you make mistakes on that front.
  8. Keep content on a single subdomain and root domain wherever possible. The metrics of that domain will go a long way to make that content visible and ranking-worthy.
  9. Avoid doing “strange” things from a technical and link acquisition perspective. The former makes you harder to crawl, process and index while the latter makes you stand out as possible spam/manipulation.

We hope you enjoy this data – please feel free to share – and enjoy using the new Linkscape index. Again, I’d like to give my congratulations and thanks to both Ben & Nick, who’ve done a tremendous job with Linkscape. If you have questions, please leave them in the comments and they should be able to provide answers and direction.

p.s. For those keeping track, this index update was almost exactly a month from our last one, and our goal is to maintain approximately 3-4 week intervals between updates for the foreseeable future. We’re also doing a lot to improve the quality and focus of our index to capture more good stuff and deep stuff on mid-size and large domains (and less spam). We’d appreciate it if those of you who are producing lots of spam would help us out by ceasing to earn links from trustworthy, respectable sites and pages – thanks! 🙂

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button